[SPARK-44856][PYTHON] Improve Python UDTF arrow serializer performance by HyukjinKwon · Pull Request #50099 · apache/spark

HyukjinKwon · 2025-02-27T05:13:35Z

What changes were proposed in this pull request?

This PR removes pandas <> Arrow <> pandas conversion in Arrow-optimized Python UDTF by directly using PyArrow.

Why are the changes needed?

Currently, there is a lot of overhead in the arrow serializer for Python UDTFs. The overhead is largely from converting arrow batches into pandas series and converting UDTF's results back to a pandas dataframe.

We should try directly converting Python object into arrow and vice versa to avoid the expensive pandas conversion.

Does this PR introduce any user-facing change?

Yes. Previously the conversion was

How was this patch tested?

Existing tests.

Was this patch authored or co-authored using generative AI tooling?

No.

python/pyspark/sql/pandas/serializers.py

python/pyspark/worker.py

python/pyspark/sql/tests/test_udtf.py

allisonwang-db · 2025-03-04T01:16:22Z

python/pyspark/sql/tests/test_udtf.py

Why do we need to check both error classes?

The problem is that, previously, we were able to create a pandas DataFrame with different schema but Arrow does not seem allowing it.

More specifically previous code path:

yield verify_result( pd.DataFrame(check_return_value(res)) ), arrow_return_type

did not throw an error at pd.DataFrame(...).

However, new code path

convert_to_arrow(func())

ret = LocalDataToArrowConversion.convert( data, return_type, prefers_large_var_types ).to_batches()

this throw an error at LocalDataToArrowConversion.convert(...).

We could further improve this but I would prefer to get out of this scope in this PR considering that there are already behaviour differences with/without Arrow.

python/pyspark/sql/pandas/serializers.py

HyukjinKwon · 2025-03-11T18:19:40Z

Let me add a legacy conf ... to be safer ..

dongjoon-hyun · 2025-03-13T13:45:59Z

Could you fix the remaining failures?

pyspark.errors.exceptions.base.PySparkRuntimeError: [UDTF_ARROW_TYPE_CAST_ERROR] Cannot convert the output value of the column 'x' with type 'object' to the specified return type of the column: 'list<element: int32>'. Please check if the data types match and try again.

Stale

dongjoon-hyun

+1, LGTM. Thank you again for making it work, @HyukjinKwon .

python/docs/source/migration_guide/pyspark_upgrade.rst

ueshin · 2025-03-18T21:42:46Z

python/pyspark/worker.py

Can we move this to before the above line?

Actually no .. otherwise, the tests fail. There is a test that throw an exception when the batch is empty. LocalDataToArrowConversion.convert checks if data is empty, and the exception will be wrapped by PySparkRuntimeError.

ueshin · 2025-03-18T21:48:56Z

python/pyspark/worker.py

Can this return multiple batches? What happens in that case?

Yeah, if it grows over the default size (https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html#pyarrow.dataset.Dataset.to_batches) it can be multiple batches. It should work though - I wrote the codes that it should work via ArrowStreamUDFSerializer.dump_stream.

ueshin · 2025-03-18T21:50:32Z

python/pyspark/worker.py

IIRC, it must return exactly one batch per input row. convert_to_arrow should always return one batch. cc @allisonwang-db

From what I read about the codes, it seems fine... but would be great if we can confirm this.

For directly UDTF usage like MyUDTF(lit(1), lit(2)) it's fine to return multiple batches, but for lateral joins like SELECT * FROM t, LATERAL MyUDTF(a, b)), we must match each input row with all output of the UDTF for that row. If we return multiple batches, then we can't distinguish which batch to join with which input row.

Just checked. https://github.com/apache/arrow/blob/d2ddee62329eb711572b4d71d6380673d7f7edd1/cpp/src/arrow/table.cc#L612-L638

The batch size will be long max by default, which I believe it's pretty safe. Arrow batch cannot contain # of rows larger than long in any way.

Stale.

python/docs/source/migration_guide/pyspark_upgrade.rst

Co-authored-by: Allison Wang <allison.wang@databricks.com>

HyukjinKwon · 2025-05-06T22:50:29Z

Merged to master.

…rk Connect compatibility test ### What changes were proposed in this pull request? This PR proposes to skip ArrowUDTFParityTests in Spark Connect compatibility test for now. ### Why are the changes needed? After #50099, the compatibility test fails https://github.com/apache/spark/actions/runs/14959668798/job/42019945629 In fact, UDTF with Arrow is still under development so we can skip the tests for now ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Will monitor the build. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50856 from HyukjinKwon/SPARK-44856-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…rk Connect compatibility test This PR proposes to skip ArrowUDTFParityTests in Spark Connect compatibility test for now. After #50099, the compatibility test fails https://github.com/apache/spark/actions/runs/14959668798/job/42019945629 In fact, UDTF with Arrow is still under development so we can skip the tests for now No, test-only. Will monitor the build. No. Closes #50856 from HyukjinKwon/SPARK-44856-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit dfc8175) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…rk Connect compatibility test ### What changes were proposed in this pull request? This PR proposes to skip ArrowUDTFParityTests in Spark Connect compatibility test for now. ### Why are the changes needed? After #50099, the compatibility test fails https://github.com/apache/spark/actions/runs/14959668798/job/42019945629 In fact, UDTF with Arrow is still under development so we can skip the tests for now ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Will monitor the build. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #50856 from HyukjinKwon/SPARK-44856-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit dfc8175) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

### What changes were proposed in this pull request? This PR removes pandas <> Arrow <> pandas conversion in Arrow-optimized Python UDTF by directly using PyArrow. ### Why are the changes needed? Currently, there is a lot of overhead in the arrow serializer for Python UDTFs. The overhead is largely from converting arrow batches into pandas series and converting UDTF's results back to a pandas dataframe. We should try directly converting Python object into arrow and vice versa to avoid the expensive pandas conversion. ### Does this PR introduce _any_ user-facing change? Yes. Previously the conversion was ### How was this patch tested? Existing tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#50099 from HyukjinKwon/SPARK-44856. Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…rk Connect compatibility test ### What changes were proposed in this pull request? This PR proposes to skip ArrowUDTFParityTests in Spark Connect compatibility test for now. ### Why are the changes needed? After apache#50099, the compatibility test fails https://github.com/apache/spark/actions/runs/14959668798/job/42019945629 In fact, UDTF with Arrow is still under development so we can skip the tests for now ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Will monitor the build. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#50856 from HyukjinKwon/SPARK-44856-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

…rk Connect compatibility test This PR proposes to skip ArrowUDTFParityTests in Spark Connect compatibility test for now. After apache#50099, the compatibility test fails https://github.com/apache/spark/actions/runs/14959668798/job/42019945629 In fact, UDTF with Arrow is still under development so we can skip the tests for now No, test-only. Will monitor the build. No. Closes apache#50856 from HyukjinKwon/SPARK-44856-followup. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org> (cherry picked from commit 9bb0409) Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>

HyukjinKwon mentioned this pull request Feb 27, 2025

[SPARK-51330][PYTHON] Enable spark.sql.execution.pythonUDTF.arrow.enabled by default #50096

Closed

github-actions bot added CORE PYTHON labels Feb 27, 2025

HyukjinKwon force-pushed the SPARK-44856 branch from 9e07f1b to 780bdb3 Compare February 27, 2025 06:58

github-actions bot added the SQL label Feb 27, 2025

HyukjinKwon force-pushed the SPARK-44856 branch from 780bdb3 to 9c8064c Compare February 27, 2025 08:24

HyukjinKwon marked this pull request as draft February 27, 2025 08:31

HyukjinKwon force-pushed the SPARK-44856 branch 10 times, most recently from ba186c8 to cb066a6 Compare February 28, 2025 05:55

HyukjinKwon marked this pull request as ready for review February 28, 2025 05:57

HyukjinKwon force-pushed the SPARK-44856 branch 4 times, most recently from 9519216 to a52e348 Compare February 28, 2025 07:41

ueshin reviewed Mar 1, 2025

View reviewed changes

python/pyspark/sql/pandas/serializers.py Outdated Show resolved Hide resolved

wengh reviewed Mar 3, 2025

View reviewed changes

python/pyspark/worker.py Show resolved Hide resolved

allisonwang-db reviewed Mar 4, 2025

View reviewed changes

dongjoon-hyun previously approved these changes Mar 7, 2025

View reviewed changes

HyukjinKwon force-pushed the SPARK-44856 branch from 557574e to 3da35ec Compare March 11, 2025 18:58

github-actions bot added the DOCS label Mar 11, 2025

HyukjinKwon force-pushed the SPARK-44856 branch from 94ab19b to 6eda44b Compare March 11, 2025 22:11

github-actions bot added the CONNECT label Mar 12, 2025

dongjoon-hyun previously approved these changes Mar 14, 2025

View reviewed changes

dongjoon-hyun reviewed Mar 15, 2025

View reviewed changes

python/docs/source/migration_guide/pyspark_upgrade.rst Outdated Show resolved Hide resolved

ueshin reviewed Mar 18, 2025

View reviewed changes

HyukjinKwon force-pushed the SPARK-44856 branch 2 times, most recently from 2632c5d to d417e42 Compare April 3, 2025 01:32

Improve Python UDTF arrow serializer performance

64ca2b1

HyukjinKwon force-pushed the SPARK-44856 branch from d417e42 to 64ca2b1 Compare April 22, 2025 04:55

allisonwang-db approved these changes May 2, 2025

View reviewed changes

python/docs/source/migration_guide/pyspark_upgrade.rst Outdated Show resolved Hide resolved

Update python/docs/source/migration_guide/pyspark_upgrade.rst

dd359c3

Co-authored-by: Allison Wang <allison.wang@databricks.com>

HyukjinKwon closed this in b957d7f May 6, 2025

HyukjinKwon mentioned this pull request May 12, 2025

[SPARK-52077][PYTHON][CONNECT][TEST] Skip ArrowUDTFParityTests in Spark Connect compatibility test #50856

Closed

Conversation

HyukjinKwon commented Feb 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon commented Mar 11, 2025

Uh oh!

dongjoon-hyun commented Mar 13, 2025

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

HyukjinKwon commented May 6, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

HyukjinKwon commented Feb 27, 2025 •

edited

Loading